Model Selection

Multimodal Reasoning

# Multimodal Reasoning

Internvl3 8B Instruct GGUF

InternVL3-8B-Instruct is an advanced multimodal large language model (MLLM) that demonstrates exceptional overall performance, with strong multimodal perception and reasoning capabilities.

Internvl3 14B Instruct GGUF

InternVL3-14B-Instruct is an advanced Multimodal Large Language Model (MLLM) that demonstrates exceptional multimodal perception and reasoning capabilities, supporting various tasks such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.

Bespoke MiniChart 7B

A 7B-parameter open-source chart understanding vision-language model developed by Bespoke Labs, outperforming closed-source models like Gemini-1.5-Pro in chart QA tasks

Safetensors English

Skywork R1V2 38B

Skywork-R1V2-38B is currently the most advanced open-source multimodal reasoning model, demonstrating outstanding performance in multiple benchmark tests with robust visual reasoning and text comprehension capabilities.

ViCA2 is a multimodal vision-language model focused on video understanding and visual-spatial cognition tasks.

Transformers English

Vica2 Stage2 Onevision Ft

ViCA2 is a 7B-parameter multimodal vision-language model focused on video understanding and visual-spatial cognition tasks.

Transformers English

Internvl3 78B Hf

InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.

Transformers Other

Spacethinker Qwen2.5VL 3B

SpaceThinker is a multimodal vision-language model that enhances spatial reasoning through test-time computation, excelling particularly in quantitative spatial reasoning and object relationship analysis.

Text-to-Image English

Internvl3 9B AWQ

InternVL3-9B is a multimodal large language model from the InternVL3 series, featuring exceptional multimodal perception and reasoning capabilities. It supports various application scenarios such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.

Transformers Other

Internvl3 8B AWQ

InternVL3-8B is an advanced multimodal large language model developed by OpenGVLab, featuring powerful multimodal perception and reasoning capabilities, supporting tool calling, GUI agents, industrial image analysis, 3D visual perception, and other emerging fields.

Transformers Other

TBAC VLR1 3B Preview

A multimodal language model fine-tuned by Tencent PCG Basic Algorithm Center, optimized based on Qwen2.5-VL-3B-Instruct, achieving state-of-the-art performance in multiple multimodal reasoning benchmarks among models of the same scale

Safetensors English

Internvl3 9B Instruct

InternVL3-9B-Instruct is the supervised fine-tuned version of the InternVL3 series, featuring powerful multimodal perception and reasoning capabilities, supporting various modalities such as images, text, and videos.

Transformers Other

Internvl3 8B Instruct

InternVL3-8B-Instruct is an advanced Multimodal Large Language Model (MLLM) that demonstrates exceptional multimodal perception and reasoning capabilities, supporting various functionalities such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.

Transformers Other

VL-Reasoner-7B is a multimodal reasoning model trained using GRPO-SSR technology, demonstrating outstanding performance across multiple multimodal reasoning benchmarks.

Transformers English

General Reasoner 14B Preview

A multimodal reasoning model trained on the Qwen2.5-14B base model and VisualWebInstruct-Verified dataset, supporting English task processing.

Large Language Model

Transformers English

Spaceqwen2.5 VL 3B Instruct GGUF

SpaceQwen2.5-VL-3B-Instruct is a multimodal vision-language model focused on spatial reasoning and embodied AI tasks.

Text-to-Image English

R01 Gemma 3 1b It

Gemma 3 is a lightweight open-source multimodal model introduced by Google, built on the same technology as Gemini, supporting text and image inputs to generate text outputs.

Transformers English

A powerful hybrid reasoning model trained through Iterative Distillation and Amplification (IDA) by DeepCogito, excelling in programming, STEM, multilingual, and agent application scenarios.

Large Language Model

Space Voice Label Detect Beta

Fine-tuned version based on Qwen2.5-VL-3B model, trained using Unsloth and Huggingface TRL library, achieving 2x inference speed improvement

Transformers English

WebDreamer is a planning framework capable of achieving efficient and effective planning for web agent tasks in the real world.

Transformers English

A multimodal large language model fine-tuned from Qwen2.5-VL using the innovative Curr-ReFT method, significantly enhancing visual-language understanding and reasoning capabilities.

STEVE R1 7B SFT I1 GGUF

This is a weighted/matrix quantized version of the Fanbin/STEVE-R1-7B-SFT model, suitable for resource-constrained environments.

Text-to-Image English

VideoMind is a multimodal agent framework that enhances video reasoning capabilities by simulating human thought processes (such as task decomposition, moment localization & verification, and answer synthesis).

Vintern 3B R Beta

Vintern-3B-R-beta is a multimodal large language model focused on complex reasoning tasks based on images, capable of decomposing reasoning steps and effectively controlling hallucination phenomena.

Transformers Supports Multiple Languages

Llama 3.2 11B Vision Medical

A model fine-tuned based on unsloth/Llama-3.2-11B-Vision-Instruct, trained using Unsloth and Huggingface's TRL library, achieving a 2x speedup.

Transformers English

Sarashina2 Vision 14b

Sarashina2-Vision-14B is a large Japanese visual language model developed by SB Intuitions, combining Sarashina2-13B with Qwen2-VL-7B's image encoder, achieving excellent performance in multiple benchmarks.

Transformers Supports Multiple Languages

Sarashina2 Vision 8b

Sarashina2-Vision-8B is a large Japanese vision-language model trained by SB Intuitions, based on the Sarashina2-7B and Qwen2-VL-7B image encoders, achieving excellent performance in multiple benchmarks.

Transformers Supports Multiple Languages

Visualthinker R1 Zero

The first multimodal reasoning model to reproduce 'Aha moments' and increased response length on just a 2B model with unsupervised fine-tuning

Safetensors English

turningpoint-ai

Qwen2.5 VL 7B Instruct Quantized.w8a8

Quantized version of Qwen2.5-VL-7B-Instruct, supporting vision-text input and text output, optimized for inference efficiency through INT8 weight quantization

Transformers English

UI-TARS is a next-generation native graphical user interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.

Transformers Supports Multiple Languages

bytedance-research

QVQ 72B Preview AWQ

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. This repository provides its AWQ 4-bit quantized version.

Transformers English

LlamaV-o1 is an advanced multimodal large language model specifically designed for complex visual reasoning tasks, optimized through curriculum learning techniques, demonstrating outstanding performance across diverse benchmarks.

Safetensors English

This model is a video language-guided reasoning segmentation model developed based on LLaVA-Phi-3-mini-4k-instruct, focusing on object segmentation tasks in videos.

Safetensors English

NVLM 1.0 is a series of cutting-edge multimodal large language models that achieve state-of-the-art results in vision-language tasks, comparable to leading proprietary and open-access models.

Transformers English

Sl Persian Ser With Gwo And Hubert

This is an open-source model based on the Apache-2.0 license. Specific details need to be added

Large Language Model

This model is released under the Apache-2.0 license, with specific details currently unknown.

Large Language Model

Shotluck Holmes 3.1

Large Language Model

Nllb Uzbek Russian

This is an open-source model based on the Apache-2.0 license; specific functionalities depend on the actual model*

Large Language Model

Finetuned Nli Provenance

Large Language Model

Gazelle v0.2 is a joint speech-language model released by Tincans, supporting English.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase